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(54) View offset estimation for steroscopic video coding 



(57) In a stereoscopic video transmission system, 
where an enhancement layer image is disparity pre- 
dicted using a lower layer images, the lower layer image 
is made to more closely match the enhancement layer 
image by shifting the lower layer image to the right to 
compensate for inter-ocular camera lens separation. 
The motion vector search range for disparity prediction 
is reduced to improve coding efficiency. At an encoder, 
the optimal offset, x, between the enhancement layer 
image and the lower layer image is determined accord- 
ing to either a minimum mean error or a minimum mean 
squared error between the enhancement and lower 



layer images. The offset x is bounded by an offset 
search range X. The x rightmost pixel columns of the 
lower layer image are deleted, and the x leftmost col- 
umns of the lower layer image are padded to effectively 
shift the lower layer image to the right by x pixels to 
obtain the reference image for use in disparity predict- 
ing the enhancement layer image. For arbitrarily shaped 
images such as VOPs within a frame, the leftmost por- 
tion is deleted and the rightmost portion is padded. At a 
decoder, the offset value x is recovered if available and 
used to reconstruct the reference frame. 
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Description 

BACKGROUND OF THE INVENTION 

5 [0001] The present invention relates to an apparatus and method for coding stereoscopic video data. In particular, a 
system for estimating the optimal offset of a scene between right and left channel views at the same temporal reference 
point is presented. The system reduces the motion vector search range for disparity (i.e., cross-channel or cross-layer) 
prediction to improve coding efficiency. 

[0002] Digital technology has revolutionized the delivery of video and audio services to consumers since it can deliver 
io signals of much higher quality than analog techniques and provide additional features that were previously unavailable. 
Digital systems are particularly advantageous for signals that are broadcast via a cable television network or by satellite 
to cable television affiliates and/or directly to home satellite television receivers. In such systems, a subscriber receives 
the digital data stream via a receiver/descrambler that decompresses and decodes the data in order to reconstruct the 
original video and audio signals. The digital receiver includes a microcomputer and memory storage elements for use 
is in this process. 

[0003] The need to provide low cost receivers while still providing high quality video and audio requires that the 
amount of data which is processed be limited. Moreover, the available bandwidth for the transmission of the digital sig- 
nal may also be limited by physical constraints, existing communication protocols, and governmental regulations. 
Accordingly various intra-frame data compression schemes have been developed that take advantage of the gaatial 

20 correlation among adjacent pixels in a particular video picture (e.g., frame). 

[0004] Moreover, inter-frame compression schemes take advantage of temporal correlations between corresponding 
regions of successive frames by using motion compensation data and block-matching motion estimation algorithms. In 
this case, a motion vector is determined for each block in a current picture of an image by identifying a block in a previ- 
ous picture which most closely resembles the current block. The entire current picture can then be reconstructed at a 

25 decoder by sending data which represents the difference between the corresponding block pairs, together with the 

motion vectors that are required to identify the corresponding pairs. Block matching motion estimating algorithms are I 
particularly effective when combined with block-based spatial compression techniques such as the discrete cosine 
transform (DCT). 

[0005] Additionally, there has been increasing interest in proposed stereoscopic video transmission formats such as 
30 the Motion Picture Experts Group (MPEG) MPEG-2 Multi-view Profile (MVP) system, described in document ISO/IEC 
JTC1/SC29/WG11 N1088 (ITU-T Recommendation H.262), entitled "Proposed Draft Amendment No. 3 to 13818-2 
(Mufti-view Profile)." November 1995, and its amendment 3; as well as the MPEG-4 Video Verification Model (VM) Ver- 
sion 3.0, described in document ISO/IEC JTC1/SC29/WG1 1 N1277, Tampere, Finland, July 1996, both of which are 
incorporated herein by reference. 
35 [0006] Stereoscopic video provides slightly offset views of the same image to produce a combined image with greater 
depth of field, thereby creating a three-dimensional (3-D) effect, in such a system, dual cameras may be positioned 
about 2.5 inches, or 65 mm, apart to record an event on two separate video signals. The spacing of the cameras 
approximates the distance between left and right human eyes, i.e.. the inter-ocular separation. Moreover, with some 
stereoscopic video camcorders, the two lenses are built into one camcorder head and therefore move in synchronism, 
40 for example, when panning across an image. The two video signals can be transmitted and recombined at a receiver to 
produce an image with a depth of field that corresponds to normal human vision. Other special effects can also be pro- 
vided. 

[0007] The MPEG MVP system includes two video layers which are transmitted in a multiplexed signal. First, a base 
(e.g., lower) layer represents a left view of a three dimensional object. Second, an enhancement (e.g., auxiliary, or 

45 upper) layer represents a right view of the object Since the right and left views are of the same object and are offset 
only slightly relative to each other, there will usually be a large degree of correlation between the video images of the 
base and enhancement layers. This correlation can be used to compress the enhancement layer data relative to the 
base layer, thereby reducing the amount of data that needs to be transmitted in the enhancement layer to maintain a 
given image quality. The image quality generally corresponds to the quantization level of the video data. 

so [0008] The MPEG MVP system includes three types of video pictures; specifically, the intra-coded picture (l-picture), 
predictive-coded picture (P-picture), and bi<Jirectionally predictive-coded picture (B-picture). Furthermore, while the 
base layer accommodates either frame or field structure video sequences, the enhancement layer accommodates only 
frame structure. An l-picture completely describes a single video picture without reference to any other picture. For 
improved error concealment, motion vectors can be included with an l-picture. An error in an l-picture has the potential 

55 for greater impact on the displayed video since both P-pictures and B-pictures in the base layer are predicted from I- 
pictures. Moreover, pictures in the enhancement layer can be predicted from pictures in the base layer in a cross-layer 
prediction process known as disparity prediction. Prediction from one frame to another within a layer is known as tem- 
poral prediction. 
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[0009] In the base layer, P pictures are predicted based on previous I or P pictures. The reference is from an earlier 
I or P picture to a future P-picture and is known as forward prediction. B-pictures are predicted from the closest earlier 
I or P picture and the closest later I or P picture. 

[0010] In the enhancement layer, a P-picture can be predicted from (a) the most recently decoded picture in the 

5 enhancement layer, (b) the most recent base layer picture, in display order, or (c) the next lower layer picture, in display 
order. Case (b) is used usually when the most recent base layer picture, in display order, is an l-picture. 
[001 1 ] Moreover, a B-picture in the enhancement layer can be predicted using (d) the most recent decoded enhance- 
ment layer picture for forward prediction, and the most recent lower layer picture, in display order, (e) the most recent 
decoded enhancement layer picture for forward prediction, and the next lower layer picture, in display order, for back- 

io ward prediction, or (f) the most recent lower layer picture, in display order, for forward prediction, and the next lower 
layer picture, in display order, for backward prediction. When the most recent lower layer picture, in display order, is an 
l-picture, only that l-picture will be used for predictive coding (e.g., there will be no forward prediction). 
[0012] Note that only prediction modes (a), (b) and (d) are encompassed within the MPEG MVP system. The MVP 
system is a subset of MPEG temporal scalability coding, which encompasses each of modes (a)-(f). 

75 [0013] In one optional configuration, the enhancement layer has only P and B pictures, but no I pictures. The refer- 
ence to a future picture (i.e., one that has not yet been displayed) is called backward prediction. Note that no backward 
prediction occurs within the enhancement layer. Accordingly, enhancement layer pictures are transmitted in display 
order. There are situations where backward prediction is very useful in increasing the compression rate. For example, 
in a scene in which a door opens, the current picture may predict what is behind the door based upon a future picture 

20 in which the door is already open. 

[0014] B-pictures yield the most compression but also incorporate the most error. To eliminate error propagation, B- 
pictures may never be predicted from other B-pictures in the base layer. P-pictures yield less error and less compres- 
sion, l-pictures yield the least compression, but are able to provide random access. 

[001 5] For disparity prediction, e.g., where a lower layer image is used as a reference image for an enhancement layer 

25 image, either alone or in combination with an enhancement layer reference image. The enhancement layer image is 
motion compensated by finding a best-match image in the reference image by searching a predefined search area, then 
ditfer ermatly encoding the pixels of the enhancement layer image using the pixels of the best-match image of the refer- 
ence mage A motion vector which defines the relative displacement of the best match image to the coded enhance- 
ment lay« te&on is transmitted with the differentially encoded pixel data to allow reconstruction of the enhancement 

30 layer mage at a decoder. Processing may occur on a macroblock by macroblock basis. 

[0016] However, the processing and memory storage requirements for disparity prediction are increased when the 
motion vecior search range is increased. Additionally, inefficient variable length coding (e.g., Huffman coding) of dispar- 
ity vectors results This results in more expensive and/or slower encoding and decoding apparatus. Accordingly, it would 
be advantageous to have a system to improve the coding efficiency of disparity predicted enhancement layer images in 

35 a stef eoscop< video system. The system should account for the interocular separation of a stereoscopic video camera 
to provide a shitted lower layer image which more closely matches the enhancement layer image. The system should 
be compare with various image sizes, including rectangular as well as arbitrarily shaped images. 
[001 7] The system should further be compatible with various existing and proposed video coding standards, such as 
MPEG-1. MPEG-2, MPEG-4. H.261 and H.263. 

40 [0018] The system should provide for the transmission of an offset value for use by a decoder in reconstructing a ref- 
erence frame. The system should also be effective with video standards that do no allow for the transmission of an off- 
set value by reducing the motion vector search range at an encoder. The technique should be suitable for both still 
images and sequences of images. 

[001 9] The present invention provides a system having the above and other advantages. 



45 



SUMMARY OF THE INVENTION 



[0020] In accordance with the present invention, a method and apparatus are presented for improving coding effi- 
ciency in a stereoscopic video transmission system by compensating for inter-ocular camera lens separation. 

so [0021 ] A method for prediction of an enhancement layer image in an enhancement layer of a stereoscopic video signal 
using a lower layer image in a lower layer thereof comprises the steps of determining an optimal offset, x, between the 
enhancement layer image and the lower layer image according to either a minimum mean error, or a minimum mean 
squared error, and shifting the lower layer image according to the optimal offset to obtain a reference image for use in 
disparity predicting the enhancement layer image. The shifting is accomplished by deleting the last (e.g., rightmost) x 

55 pixel columns of the lower layer image and padding the first (e.g. . leftmost) x pixel columns according to the pre-existing 
first pixel column (i.e., the leftmost column before shifting). 

[0022] The enhancement layer image is disparity predicted from the reference image using motion compensation, and 
a best-match image, such as a macroblock, is obtained in the reference image using a search range which is reduced 
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relative to a search range of the lower layer image without the shifting. 

[0023] An estimated offset may be determined according to a camera focus parameter and an inter-ocular separation, 
in which case the lower layer image can be searched in a range determined by the estimated offset to find the optimal 
offset 

s [0024] The enhancement layer image and the lower layer image may comprise video object planes or other arbitrarily 
shaped images as well as rectangular images (e.g., frames). 

[0025] A new optimal offset x may be determined when a scene change is detected for the lower layer image. If a 
scene change is not detected, an offset from a prior image in the lower layer can be used as the optimal offset x. Option- 
ally, a new optimal offset x may be determined for a new group of pictures in the lower layer. 
io [0026] The optimal offset x may be transmitted in the stereoscopic video signal for use by a decoder in recreating the 
reference image. 

[0027] For the minimum mean error, the optimal offset x is determined such that the value 

(W-X-1)(h-1) 

15 Di *- Ll(x) = ftlw^ Z Z|{yL(i+xj>y E (i,j)| 

i=0 j=0 



is minimized, where y L and y E represent luminance pixel values of the lower and enhancement layer images, respec- 
20 tively, i and j are horizontal and vertical Cartesian coordinates, respectively, in the lower and enhancement layer 
images, h is the height of the lower layer image, w is the width of the lower layer image, the lower layer image is a left- 
view image and the enhancement layer image is a right- view image. 

[0028] For the minimum mean squared error, the optimal offset x is determined such that the value 



25 (w-x-1)(h-1) 

Z X 

i=0 j=0 



Dist - L (x) = h(w^ £ X<*LC>+xj>-yEtt>>' 



30 [0029] The offset for chrominance data is Lx/2 J for 4:2:0 video. 
[0030] A corresponding apparatus and decoder are also presented. 

BRIEF DESCRIPTION OF THE DRAWINGS 



35 [0031] 

FIG. 1 is a block diagram of a coder/decoder structure for stereoscopic video. 
FIG. 2 is a schematic diagram of a stereoscopic video camera model. 
FIG. 3 is an illustration of a disparity prediction mode for P-pictures in the enhancement layer. 
40 FIG. 4 is an illustration of an enhancement layer predict mode for B-pictures. 

FIG. 5 illustrates processing of a left-view picture in accordance with the present invention. 
FIG. 6 illustrates an encoder process flow in accordance with the present invention. 
FIG. 7 illustrates a decoder process flow in accordance with the present invention. 

FIG. 8 illustrates disparity prediction and motion vector searching in accordance with the present invention. 
45 FIG. 9 illustrates motion vector searching in accordance with the present invention. 

FIG. 1 0 is a block diagram of an enhancement layer decoder structure in accordance with the present invention. 

DETAILED DESCRIPTION OF THE INVENTION 



so [0032] A method and apparatus are presented for estimating the optimal offset of a scene between right and left chan- 
nel views in a stereoscopic video system. 

[0033] FIG. 1 is a block diagram of a coder/decoder structure for stereoscopic video. The MPEG MVP standard and 
similar systems involve coding of two video layers, including a lower layer and an enhancement or upper layer. For such 
an application, the lower layer is assigned to a left view while the enhancement layer is assigned to a right view. In the 
55 coder/decoder (e.g., codec) structure of FIG. 1 . the lower layer and enhancement layer video sequences are received 
by a temporal remuttiplexer (remux) 105. Using time division multiplexing (TDMX), the enhancement layer video is pro- 
vided to an enhancement encoder 110, while the base layer video is provided to a lower encoder 1 15. Note that the 
lower layer video data may be provided to the enhancement encoder 1 1 0 for disparity prediction. 
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[0034] The encoded enhancement and base layers are then provided to a system multiplexer 120 for transmission to 
a decoder, shown generally at 122, as a transport stream. The transmission path is typically a satellite link to a cable 
system headend or directly via satellite to a consumer's home. At the decoder 122, the transport stream is demulti- 
plexed at a system demultiplexer 125. The encoded enhancement layer data is provided to an enhancement decoder 
5 1 30, while the encoded lower layer data is provided to a lower decoder 1 35. Note that decoding is preferably carried out 
concurrently with the lower and enhancement layers in a parallel processing configuration. Alternatively the enhance- 
ment decoder 130 and lower decoder 135 may share common processing hardware, in which case decoding may be 
carried out sequentially, one picture at a time. 

[0035] The decoded lower layer data is output from the lower decoder 135 as a separate data stream, and is also 

io provided to a temporal remultiplexer 140. At the temporal rernultiplexer 140, the decoded base layer data and the 
decoded enhancement layer data are combined to provide an enhancement layer output signal as shown. The 
enhancement and lower layer output signals are then provided to a display device for viewing. 
[0036] FIG. 2 is a schematic diagram of a stereoscopic video camera model. The camera apparatus, shown generally 
at 100. includes a right view lens 120 and a left view lens 1 10 with respective axes 125 and 115 which are separated 

is by an inter-ocular distance 5 (130), typically 65 mm. The axes 1 15 and 125 intersect a camera plane 140. The camera 
apparatus 100 includes two identical cameras, each with a respective lens, so that two separate recordings of a scene 
are obtained. The cameras are oriented with parallel axes and coplanar image sensors, such as charge coupled 
devices (CCDs). Thus, the displacement (disparity) of two images of a scene at a given moment is mainly horizontal 
and is created by the horizontal separation of the lenses 1 1 0 and 120. 

20 [0037] A stereoscopic imaging system replicates the principle of human vision system to provide two views of a 
scene. By presenting the appropriate views on a suitable display to the corresponding left- and right-eyes of a viewer, 
two slightly different perspective views of the scene are imaged on each retina. The brain then fuses these images into 
one view, and the viewer experiences the sensation of stereopsis (stereoscopic vision), which provides added realism 
through improved depth perception. 

25 [0038] To efficiently transmit stereoscopic video data, coding (e.g., compression) of the images of the two views must 
be efficient. Efficient coding of a stereoscopic video depends not only on motion compensation, but also on disparity 
(e.g.. cross-channel or cross-layer) prediction. By reducing a motion vector search range for disparity prediction 
between left- and right-view pictures, a low complexity encoder can be implemented. This is achieved by optimally esti- 
mating the global location -offset of a scene between pictures of two views at the same temporal reference point. 

30 [0039] The system presented herein may be used a performance enhancement option of the MPEG-2 Multi-View Pro- 
file (MVP) and MPEG-4 Video Verification Model (VM) (Version 3.0 and above) experiments for disparity prediction of 
stereoscopic video coding. MVP (or MPEG-4 MV 3.0) involves two layer coding, namely a lower or base layer and an 
enhancement layer. For stereoscopic video coding, the lower layer is assigned to the left view and the enhancement 
layer is assigned to the right view. The disparity estimation/prediction modes of the enhancement layer in MVP for P- 

35 and B-pictures consist of a macroblock-based block matching technique. In an MVP decoder, these prediction modes 
are shown in FIGs 3, 4 and 8. 

[0040] With stereoscopic video coding, a horizontal disparity vector for each disparity-predicted macroblock is 
expected because of the offset of the view points. In fact, this causes inefficient variable length (Huffman) coding (VLC) 
of these disparity vectors. The present invention addresses the problem of how to determine the horizontal offset of 
40 stereoscopic views such that the coding of estimated disparity vectors becomes more efficient. 

[0041 ] In accordance with the present invention, the left-view image is offset by an appropriate number of pixels such 
that the displacement between the offset left- view image and the right- view image can be reduced. The disparity pre- 
diction based on this new image pair is therefore more efficient. 

[0042] FIG. 3 is an illustration of a disparity prediction mode for P-pictures in the enhancement layer. Here, a P-picture 
45 31 0 in the enhancement layer is disparity predicted using a temporally coincident l-picture 300 in the lower layer. 

[0043] FIG. 4 is an illustration of an enhancement layer predict mode for B-pictures. Here, a B-picture 410 in the 
enhancement layer is predicted using both forward prediction and disparity prediction. Specifically, the B-picture 41 0 is 
forward predicted using another B-picture 420, which is the most recent decoded enhancement layer picture, and an I- 
picture 400, which is the most recent lower layer picture, in display order. 
so [0044] FIG. 5 illustrates processing of a left-view picture in accordance with the present invention. A global horizontal 
position offset technique of the present invention improves coding efficiency while maintaining compatibility with exist- 
ing stereoscopic coding standards. The global horizontal position offset method obtains a horizontal position shift of the 
left-view image such that the distortion between the (shifted) left-view image and the corresponding right-view image is 
minimized. This technique is applicable to arbitrarily shaped images such as Video Object Planes (VOP) as discussed 
55 in the MPEG-4 standard as well as rectangular images, e.g., a video frame or picture or sub-portion thereof as used in 
the MPEG-2 MVP standard. Specifically, a VOP in a left-view image is shifted to the right by deleting the x leftmost pix- 
els which extend vertically on the VOP, i.e., at the leftmost edge of the VOP, and padding x pixels starting at the right- 
most edge of the VOP. Thus, the rightmost edge is extended horizontally by x pixels. The position of the VOP is thus 
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shifted with respect to the left-view frame in which it is situated as well as with respect to the corresponding VOP in the 
right-view image. Generally, the rightmost and leftmost portions of the left-view frame are unchanged, assuming the 
VOP does not extend to the vertical boundaries of the frame. 

[0045] In FIG. 5, a left-view image 500 and right-view image 510 are shown. Parameters h and w denote the high and 
5 width, respectively, for both images. For example, for NTSC video, h=480 and w=704, and for PAL video, h=576 and 
w=704). Parameters y L (i,jj and y R (i.D represent the luminance pixel values of the left- (or lower) and right-view images, 
respectively. The parameter y R (ij) may be referred to as y^{\,i} where the subscript "EE" denotes the enhancement layer. 
[0046] The technique is discussed assuming the left-view image is in the lower layer and the right-view image is in 
the enhancement layer. However, the technique is easily adapted for use in a stereoscopic video system where the 
10 right-view image is in the lower layer and the left-view image is in the enhancement layer. 

[0047] The left-view image 500 includes a feature 505, while the right-view image 510 includes the same feature 51 5 
but in a different relative position within the frame. Specifically, the image 500 is relatively offset to the left of the image 
510 by a distance x. In a first rtep, the value x is the horizontal offset which is to be determined, and is assumed to fall 
within a pre-assigned or pre-determined range X, that is, 0 <, x < X. 
75 [0048] The global horizontal position offset technique in accordance with a first embodiment of the present invention 
is to find the horizontal offset integer value x such that: 

(w-x-1)(h-1) 

DfeLL'M-j^ £ £{yJi + x.j)-y E (i,j)} 2 

20 K ' bO j=0 



is minimized, where y L and yg represent the luminance pixel values of the fower and enhancement layer images, 
respectively, i and j are. horizontal and vertical Cartesian coordinates, respectively, in the lower and enhancement layer 
25 images, h is the height of each image, and w is the width of each image. This techniques uses a minimum mean 
squared error between pixel values of the enhancement and lower layer images. Note that h(w-x) denotes multiplica- 
tion, not a function of h An exhaustive search is performed horizontally for 0 £ x £ X to find the offset x such that 
Dist L : (x) t% a mnimum 

[0049] \n another embodiment of the present invention, the offset value x is found such that: 

30 

(w-x-1)(h-1) 

Dist„L 1 (x) = f ^ £ £ |{y L (i + x,j>y E (i.j)| 

i=0 j=0 

35 

is a minimum This technique, which uses a minimum mean error between pixel values of the enhancement and lower 
layer images, can be implemented with reduced computational requirements. 

[0050] In another embodiment of the present invention, a horizontal offset Xe St is estimated by using a camera focus 
parameter and the inter-ocular separation i>. For example, an estimated offset of ten pixels (e.g., +/-5) may be used. 
40 Then, an exhaustive horizontal search is performed for max{Xe St -5, 0} <, i <> { Xe St +5 } to find the offset x such that 
DistJ_ 1 (x) or DistJ_ 2 (x) is a minimum. 

[0051] A left-view reference frame for disparity estimation and prediction is obtained as follows. After determining the 
horizontal offset x in the encoder, a reference frame is constructed from the original and reconstructed left-view images 
for disparity estimation/prediction of the right-view image. If the video standard allows the offset value x to be transmit- 
45 ted to a decoder, the offset x is extracted at the decoder, and the reference frame is reconstructed from the decoded 
left-view image for disparity prediction/compensation of the right-view image. The offset may be transmitted in the user 
data portion of a picture header, for example. 

[0052] The construction process of the reference frame for luminance pixels is achieved, in a second step, by deleting 
the last x columns of the left-view image. At the encoder, the original left-view image is used, while at the decoder, the 
so decoded left-view image is used. Referring to the left-view image 535, the last x columns 520 at the right-hand side of 
the image 535 are deleted. 

[0053] In a third step, for each row of the left-view image 540, fill x pixels in the beginning of the row with the first pixel 
value of the row. The fill (e.g., padding) process can be accomplished as described in the MPEG-4 standard. The pad- 
ded region 530 is shown at the left-hand side of the image 540. As a result of the foregoing steps, an offset or shifted 
55 left-view image 540 is obtained that more closely matches the corresponding right-view image. 

[0054] For the chrominance pixel data, the construction process of the reference frame for disparity prediction con- 
sists of the same steps given, but with a horizotal offset of Lx/2 J, that is, x/2 with rounding down to the next integer. 
This assumes a 4:2:0 video format. The offset may be modified for other formats as required. 
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[0055] FIG. 6 illustrates an encoder process flow in accordance with the present invention. The process shown cor- 
responds to the case where the horizontal offset value x can be transmitted to a decoder. For the case where the hori- 
zontal offset cannot be transmitted, e.g.. with the MPEG-2 MVP standard, the horizontal offset value x can still be used 
to reduce the complexity of disparity vector searching in the encoder, as discussed in connection with FIGs 8 and 9. 

5 [0056] The offset value x may be determined according to various protocols. For example, x may be computed and 
stored for each successive image in a video sequence. However, this may be computationally burdensome and unnec- 
essary. Alternatively, the offset x may be determined whenever a scene change is detected, or at the start of a new 
group of pictures (GOP). A group of pictures (GOP) indicates one or more consecutive pictures which can be decoded 
without reference to pictures in another GOP. The selection of an optimum criteria for recalculating the offset x should 

10 be based on implementation complexity and video characteristics. 

[0057] If the offset x is not newly recalculated for the current image, the previous stored offset can be used. 
[0058] The left-view image is provided to a block 610, where it is determined whether a scene change or a new GOP 
is detected. If so, at block 620, the offset search range X (where 0 ^ x < X) is loaded, e.g. ? into memory for use by a 
microcomputer. If not, at block 600, the horizontal offset x which was determined from the last scene is used. 

75 [0059] At block 630, the offset x is determined using either the minimum mean error or the minimum mean squared 
error discussed previously. The right-view image data is used for this procedure. At block 640, the reference frame is 
constructed using the procedure discussed in connection with FIG. 5. The right-view image data is also used for this 
procedure. 

[0060] At block 650, the newly-constructed reference frame is searched to determine best-match macroblocks. That 

20 is, a search range is defined in the reference frame over which each macroblock is corrpared to a right-view macroblock 
which is currently being coded to determine the one reference frame macroblock which most closely matches the right- 
view macroblock which is currently being coded. Since the reference frame is offset relative to the original left-view 
image, it more closely resembles the right-view image, and a reduced search range may be used to obtain the best 
match macroblock. For example, as discussed in connection with FIG. 9 below, the search range may be reduced from 

25 64x48 pixels to 8x8 pixels, for example. 

[0061] At block 660. the right-view image is encoded using known techniques, such as those cfisclosed in the MVP 
standard. At block 670, the encoded data and the offset x are transmitted to a decoder, e.g.. in a satellite broadcast 
CATV network, as discussed in connection with FIG. 7. Some video communication standards may not provide for the 
transmission of the offset value x, in which case the offset can be used only at the encoder to reduce the search range. 

so [0062] FIG. 7 illustrates a decoder process flow in accordance with the present invention. In this case, the offset x is 
assumed to be transmitted with the video data in a coded bitstream. At block 700, the horizontal offset is extracted from 
the coded bitstream. At block 710, the left-view image is decoded in a conventional manner. At block 720, the reference 
frame is constructed using the offset x. At block 730, the right-view image is disparity predicted using the encoded right- 
view image data and the reference frame. The offset x and motion vectors are used to identify the best-match macrob- 

35 locks of the reference frame, and the full right-view image is recovered using the sum of the pixel data of the best-match 
macroblocks and the differentially encoded right-vew image data. 

[0063] For cases where the horizontal offset can not be transmitted, e.g. with the MPEG-2 MVP standard, the hori- 
zontal offset can still be used to reduce the complexity of the disparity vector search in the encoder, e.g., by reducing 
the motion vector search range. 

40 [0064] FIG. 8 illustrates disparity prediction and motion vector searching in accordance with the present invention. 
The enhancement layer includes a P-picture 810, a B-picture 820, and a B-picture 830, while the lower layer includes 
an l-picture 840, a P-picture 850 and a P-picture 860. Prediction is indicated by the direction of the arrows such that the 
arrow points from the reference image to the predicted image. For example, each macroblock in the P-picture 850 is 
predicted using corresponding best-match macroblocks in the l-picture 840. 

45 [0065] For each ith macroblock, a motion vector (v x ,v y ) indicates the relative displacement of the best-match macrob- 
lock to the predicted macroblock. For lower layer prediction, the estimation is centered at a non-offset position of each 
macroblock. For example, the upper left hand pixel of each predicted macroblock may be taken as the non-offset coor- 
dinate (0,0). 

[0066] The B-picture 820 is disparity predicted using the P-picture 850 in the lower layer and temporally predicted 
so using the P-picture 810 in the enhancement layer. For disparity prediction, the horizontal offset x is determined as dis- 
cussed. Next, macroblocks in the B-picture 820 are disparity predicted by locating best-match macroblocks in the P-pic- 
ture 850, where the disparity estimatiorvprediction is centered on (x.0) rather than (0,0). That is, the estimation is shifted 
by x pixels to the right. 

[0067] The disparity vector (v x ,v y ) indicates the positional difference between corresponding macroblocks of pixels of 
55 the base layer and the enhancement layer, and is used for reconstruction of the disparity-predicted enhancement layer 
picture at a decoder. In particular, with the pixel coordinates for a search window macroblock in the enhancement layer 
being (x s ,y s ) t and the pixel coordinates for a corresponding reference window macroblock in the base layer being (x p y r ), 
the disparity vector is v=(v x ,v y )=(x s -x r , y s -y r ) . Thus, the disparity vector is a measure of a positional or translational 
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difference between the search window and the reference window. The disparity vectors may be transmitted in the right 
view channel data stream for use in reconstructing the disparity-predicted enhancement layer picture at a decoder. 
[0068] Moreover, the temporal prediction of the B-picture 820 using the P-picture 810 is centered at (v x ,v y ) for each 
ith macroblock 

5 [0069] The disparity prediction and motion vector searching process can be further understood with reference to FIG. 
9. 

[0070] FIG. 9 illustrates motion vector searching in accordance with the present invention. As discussed in connection 
with FIG. 8, a vector (v x ,v y ) defines a best match macroblock 920 in the l-picture 840 for an ith macroblock 900 in the 
P-picture 850. The vector indicates the amount of temporal movement of an image between the two pictures. A search 
10 range 910 is used to find the best match macroblock 920. The search range may have a total size of 82x64 pixels, cor- 
responding to a variation of 64x48 for the 16x16 macroblock 900. 

[0071] For disparity prediction of macroblocks in the B-picture 820 in the enhancement layer, the ith macroblock 930 
is centered at (x,0), and is comparecfcto macroblocks in a smaller search range 940, for example, having a total size of 
24x24 pixels, corresponding to a variation of 8x8 for a 16x16 macroblock. The offset value x allows a smaller search 
is range to be used since the best-match macroblock for differentially encoding the macroblock 930 is likely to be in a 
smaller neighborhood of pixels near macroblock 930. Accordingly, a faster processing time and reduced memory 
requirements can be realized. 

[0072] Additionally, when the offset value is transmitted to the decoders, more efficient variable length coding (e.g., 
Huffman coding) of disparity vectors results since each disparity vector is smaller, thereby reducing the amount of data 

20 which must be transmitted. 

[0073] A macroblock in the B-picture 820 which is co-sited with the macroblock 900 in the P-picture 850 can also use 
a smaller search range in the P-picture 810 which is centered on the macroblock 920 defined by the vector (v x ,v y ). For 
example, the motion vector search range for the right-view sequence can also be reduced as low as an 8x8 variation. 
This is true since the correlation between the B-picture 820 and the P-picture 81 0 is likely to be similar to the correlation 

25 between the P-picture 850 and the l-picture 840. 

[0074] FIG. 10 is a block diagram of an enhancement layer decoder structure in accordance with the present inven- 
tion. The decoder, shown generally at 1 30. includes an input terminal 1 005 for receiving the compressed enhancement 
layer data, and a transport level syntax parser 1 01 0 for parsing the data. The parsed data is provided to a memory man- 
ager 1030, which may comprise a central processing unit. The memory manager 1030 communicates with a memory 

30 1020, which may comprise a dynamic random-access memory (DRAM), for example. The horizontal offset x may be 
communicated with the enhancement layer data or otherwise provided in the stereoscopic video signal. A reference 
frame is constructed using the decoded lower layer data and the offset x. 

[0075] The memory manager 1030 also communicates with a decompression/prediction processor 1040. and 
receives decoded lower level data via terminal 1050 which may be stored temporarily in the memory 1020 for subse- 

35 quent use by the processor 1040 in decoding the disparity-predicted enhancement layer pictures. 

[0076] The decompression/prediction processor 1 040 provides a variety of processing functions, such as error detec- 
tion and correction, motion vector decoding, inverse quantization, inverse discrete cosine transformation, Huffman 
decoding and prediction calculations, for instance. After being processed by the decompressionyprediction function 
1040, decoded enhancement layer data is output by the memory manager. Alternatively, the decoded data may be out- 

40 put directly from the decompression/prediction function 1 040 via means not shown. 

[0077] An analogous structure may be used for the lower layer. Moreover, the enhancement and lower layer decoders 
may share common hardware. For example, the memory 1020 and processor 1040 may be shared. 
[0078] Test results conform that the view offset estimation technique of the present invention can effectively improve 
coding efficiency for stereoscopic video signals. The offset estimation technique was irrplemented in a MPEG-2 MVP 

45 program and run through the Class D video test sequences of ISO/JEC JTC 1 /SC29/WG 1 1/MPEG-4 and some other 
sequences. Examples of test results with an offset search range of X=20 pixels are shown in Table 1 . The improvement 
in coding efficiency over MVP in bits/frame ranges from 2.0 to 5.2%. PSNR indicates the peak signal-to-noise ratio. All 
picture types are P -pictures. 

50 

TABLE 1 



Sequence 


Quantization 
Level Q 


PSNR 


Total coded bits 


Improvement 
(bits/frame) 


Right-view bit rate 


Tunnel : (Offset Values 
x=2; Frame No. n = 50-th) 


26 


31 


210,818 


2% 


3 Mbits/sec. 



DOC ID: <EP 0915433A2_I_> 



EPO 915 433 A2 



TABLE 1 (continued) 



oequence 


v-juamizanon 
Level Q 




Total coded bits 


Improvement 
(bits/frame) 


Right-view bit rate 


Tunnel : (Offset Values 
x=2; Frame No. n = 50th) 


33 


30 


172,01 1 




Cm IVIUlLO/OCLf. 


Fun Fair (Offset Values 
x=8; Frame No. n = 2nd) 


26 


31 


223,939 


2.3% 


3 Mbits/sec. 


Fun Fair (Offset Values 
x=8; Frame No. n = 2nd) 


33 


30 


181,071 


5.2% 


2 Mbits/sec. 



[0079] Further coding efficiency improvements can be achieved by using a threshold T to zero the residual macrob- 
lock after compensation, or zero some high frequency DCT coefficients. 
75 [0080] As can be seen, the present invention provides a system for estimating the optimal offset x of a scene between 
right and left channel views at the same temporal reference point. The system reduces the motion vector search range 
for disparity (i.e., cross-channel or cross-layer) prediction to improve coding efficiency. The offset may be recalculated 
when there is a scene change or a new group of pictures in the lower layer. 

[0081] At an encoder, the optimal offset, x, between the enhancement layer image and the lower layer imagp is deter- 
20 mined according to either a minimum mean error between the enhancement and lower layer images, or a minimum 
mean squared error between the enhancement and lower layer images. The offset x is bounded by an offset search 
range X. The x rightmost pixel columns of the lower layer image are deleted, and the x leftmost columns of the lower 
layer image are padded to effectively shift the lower layer image to the right by x pixels to obtain the reference image for 
use in disparity predicting the enhancement layer image. For arbitrarily shaped images such as VOPs, a VOP in a left- 
25 view image is shifted to the right by deleting the x leftmost pixels which extend vertically on the VOP. and padding x pix- 
els starting at the rightmost edge of the VOP. 

[0082] The reference frame is searched to obtain best-match macroblocks. and the right-view data is differentially 
encoded. At a decoder, the offset value x is recovered if available and used to reconstruct the reference frame for dis- 
parity prediction. 

30 [0083] Although the invention has been described in connection with various specific embodiments, those skilled in 
the art will appreciate that numerous adaptations and modifications may be made thereto without departing from the 
spirit and scope of the invention as set forth in the claims. 

Claims 

35 

1. A method for predicting an enhancement layer image in an enhancement layer of a stereoscopic video signal using 
a lower-layer image in a lower layer thereof, comprising the steps of: 

determining an optimal offset, x, between said enhancement layer image and said lower layer image according 
40 to one of (a) a minimum mean error between pixel values of said enhancement layer image and said lower 

layer image, and (b) a minimum mean squared error between pixel values of said enhancement layer image 
and said lower layer image; and 

shifting said lower layer image according to said optimal offset to obtain a reference image for use in disparity 
predicting the enhancement layer image. 

45 

2. The method of claim 1 , wherein: 

the enhancement layer image is disparity predicted from said reference image using motion compensation; 
and 

50 a best-match image is obtained in said reference image using a search range which is reduced relative to a 

search range of said lower layer image without said shifting. 

3. The method of claim 1 or 2, comprising the further steps of: 

55 determining an estimated offset according to at least one of a camera focus parameter and an inter -ocular sep- 

aration; and 

searching within said lower layer image in a range determined by said estimated offset to find said optimal off- 
set. 
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4. The method of one of the preceding claims, comprising the further step of: 

searching within a horizontal offset range X to find said optimal offset x such that 0 £ x ^ X. 
5 5. The method of one of the preceding claims, wherein: 

said enhancement layer image and said lower layer image comprise a video object plane. 

6. The method of one of the preceding claims, wherein: 

10 

said enhancement layer image and said lower layer image are arbitrarily shaped. 

7. The method of claim 6, wherein said shifting step comprises the steps of: 

is deleting a leftmost edge region of the VOP which has a width of x pixels; and 

and padding a rightmost edge portion of the VOP to extend the rightmost edge portion by a width of x pixels. 

8. The method of one of the preceding claims, wherein said shifting step comprises the steps of: 

20 deleting x rightmost pixel columns of the lower layer image; and 

padding a leftmost portion of the lower layer image with x pixel columns. 

9. The method of one of the preceding claims, comprising the further steps of: 

25 determining a new optimal offset x when a scene change is detected for the lower layer image; and 

if a scene change is not detected, using an offset from a prior image in said lower layer as said optimal offset x. 

10. The method of one of the preceding claims, wherein: 

30 a new optimal offset x is determined for a new group of pictures in the lower layer. 

11. The method of one of the preceding claims, comprising the further step of: 

transmitting said optimal offset x in said stereoscopic video signal for use by a decoder in recreating the refer- 
as ence image. 

12. The method of one of the preceding claims, wherein: 

for said minimum mean error, said optimal offset x is determined such that the value 

40 

(w-x-1)(h-1) 

Dist - L (x) = fira £ ZlfrL0+*J>-yE0j>| 

i=0 ]=0 

45 

is minimized, where y L and yg represent luminance pixel values of the lower and enhancement layer images, 
respectively, i and j are horizontal and vertical Cartesian coordinates, respectively, in the lower and enhance- 
ment layer images, h is the height of the lower layer image, w is the width of the lower layer image, said lower 
layer image is a left-view image and said enhancement layer image is a right-view image. 

50 

13. The method of claim 12, wherein: 

for said minimum mean error, an optimal offset for chrominance pixel values is Lx/2 J. 
55 14. The method of one of claims 1 to 1 1 , wherein: 

for said minimum mean squared error, said optimal offset x is determined such that the value 



10 

JSDOCID: <EP 0915433A2J_> 



EP 0 915 433 A2 



(w-x-1)(h-1) 

Dist - L (x) - Rra S Z{yL(i«.j)-y E (i.i)} 2 

i=0 j=0 

5 is minimized, where y L and y E represent the luminance pixel values of the lower and enhancement layer images, 
respectively, i and j are horizontal and vertical Cartesian coordinates, respectively, in the lower and enhancement 
layer images, h is the height of the lower layer image, and w is the width of the lower layer image, said lower layer 
image is a left-view image and said enhancement layer image is a right-view image. 

io 15. The method of claim 14, wherein: 

for said minimum mean squared error, an optimal offset for chrominance pixel values is Lx/2 J. 

1 6. An apparatus for predicting an enhancement layer image in an enhancement layer of a stereoscopic video signal 
75 using a lower layer image in a lower layer thereof, comprising: 

means for determining an optimal offset, x, between said enhancement layer image and said lower layer image 
according to one of (a) a minimum mean error between pixel values of said enhancement layer image and said 
lower layer image, and (b) a minimum mean squared error between pixel values of said enhancement layer 
20 image and said lower layer image; and 

means for shifting said lower layer image according to said optimal offset to obtain a reference image for use 
in disparity predicting the enhancement layer image. 

17. The apparatus of claim 16, wherein: 

25 

the enhancement layer image is disparity predicted from said reference image using motion compensation; 
and 

a best-match image is obtained in said reference image using a search range which is reduced relative to a 
search range of said lower layer image without said shifting. 

30 

18. The apparatus of claim 16 or 17, further comprising: 

means for determining an estimated offset according to at least one of a camera focus parameter and an inter- 
ocular separation; and 

35 means for searching within said lower layer image in a range determined by said estimated offset to find said 

optimal offset. 

1 9. The apparatus of one of claims 1 6 to 1 8, further comprising: 

40 means for searching within a horizontal offset range X to find said optimal offset x such that 0 <> x ^ X. 

20. The apparatus of one of claims 1 6 to 1 9, wherein: 

said enhancement layer image and said lower layer image comprise a video object plane. 

45 . 

21 . The apparatus of one of claims 1 6 to 20, wherein: 

said enhancement layer image and said lower layer image are arbitrarily shaped. 

so 22. The apparatus of claim 21 , wherein said means for shifting deletes a leftmost edge region of the VOP which has a 
width of x pixels, and pads a rightmost edge portion of the VOP to extend the rightmost edge portion by a width of 
x pixels. 

23. The apparatus of one of claims 16 to 22, wherein said means for shifting deletes x rightmost pixel columns of the 
55 lower layer image, and pads a leftmost portion of the lower layer image with x pixel columns. 

24. The apparatus of one of claims 16 to 23. further comprising means for: 
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(a) determining a new optimal offset x when a scene change is detected for the lower layer image; and 

(b) if a scene change is not detected, using an offset from a prior image in said lower layer as said optimal off- 
set x. 

5 25. The apparatus of one of claims 16 to 24, wherein: 

a new optimal offset x is determined for a new group of pictures in the lower layer. 

26. The apparatus of one of claims 16 to 25, further comprising: 

10 

means for transmitting said optimal offset x in said stereoscopic video signal for use by a decoder in recreating 
the reference image. 

27. The apparatus of claim 16, wherein: 

is 

for said minimum mean error, said optimal offset x is determined such that the value 

(w-x-1)(h-1) 

20 i=0 j=0 

is minimized, where y L and y E represent luminance pixel values of the lower and enhancement layer images, 
respectively, i and j are horizontal and vertical Cartesian coordinates, respectively, in the lower and enhancement 
25 layer images, h is the height of the lower layer image, w is the width of the lower layer image, said lower layer image 
is a left-view image and said enhancement layer image is a right-view image. 

28. The apparatus of claim 27, wherein: 

30 for said minimum mean error, an optimal offset for chrominance pixel values is Lx/2 J . 

29. The apparatus of one of claims 16 to 25, wherein: 

for said minimum mean squared error, said optimal offset x is determined such that the value 

35 

(w-x-1)(h-1) 

DisL^M-f^ £ X{y L (i+xj>y E aD} 2 

i=0 j=0 

40 

is minimized, where y L and y E represent the luminance pixel values of the lower and enhancement layer images, 
respectively, i and j are horizontal and vertical Cartesian coordinates, respectively, in the Iowa- and enhancement 
layer images, h is the height of the lower layer image, and w is the width of the lower layer image, said lower layer 
image is a left-view image and said enhancement layer image is a right-view image. 

45 

30. The apparatus of claim 29, wherein: 

for said minimum mean squared error, an optimal offset for chrominance pixel values is Lx/2 J. 

so 31 . A decoder for predicting an enhancement layer image in an enhancement layer of a stereoscopic video signal using 
a lower layer image in a lower layer thereof, comprising: 

means for recovering an optimal offset, x, between said enhancement layer image and said lower layer image 
from said stereoscopic video signal; 
ss said optimal offset x being determined at an encoder according to one of (a) a minimum mean error between 

pixel values of said enhancement layer image and said lower layer image, and (b) a minimum mean squared 
error between pixel values of said enhancement layer image and said lower layer image; and 
means for shifting said lower layer image according to said optimal offset to obtain a reference image for use 
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in disparity predicting the enhancement layer image. 

32. The decoder of claim 31 , wherein: 

s the enhancement layer image is disparity predicted from said reference image using motion compensation; 

and . 

a best-match image is obtained in said reference image using a search range which is reduced relative to a 
search range of said lower layer image without said shifting. 

w 33. The decoder of claim 31 or 32, wherein: 

said enhancement layer image and said lower layer image comprise a video object plane. 

34. The decoder of one of claims 31 to 33, wherein: 

said enhancement layer image and said lower layer image are arbitrarily shaped. 



15 



35. The decoder of claim 34, wherein said means for shifting deletes a leftmost edge region of the VOP which has a 
width of x pixels, and pads a rightmost edge portion of the VOP to extend the rightmost edge portion by § width of 

20 x pixels. 

36. The decoder of one of claims 31 to 35, wherein said means for shifting deletes x rightmost pixel columns of the 
lower layer image, and pads a leftmost portion of the lower layer image with x pixel columns. 

25 37. The decoder of one of claims 31 to 36. wherein: 

for said minimum mean error, said optimal offset x is determined such that the value 



30 



(w-x-1)(rt-l) 

DistJ - (x) = R(w^ S Z|{yLO+x,o-y E (ij)| 



i=0 j=0 



is minimized, where y L and y E represent luminance pixel values of the lower and enhancement layer images, 
35 respectively, i and j are horizontal and vertical Cartesian coordinates, respectively, in the lower and enhance- 

ment layer images, h is the height of the lower layer image, w is the width of the lower layer image, said lower 
layer image is a left-view image and said enhancement layer image is a right-view image. 

38. The decoder of claim 37, wherein: 

40 

for said minimum mean error, an optimal offset for chrominance pixel values is Lx/2 J . 

39. The decoder of one of claims 31 to 36, wherein: 

45 for said minimum mean squared error, said optimal offset x is determined such that the value 

(w-x-1){h-1) 

E S 

U0 j=0 

50 



(w-x-1)(h-1) 

DistJ. 2 (x) = £ £ { y L (i + x,j>y E (i,j) } : 



is minimized, where y L and y E represent the luminance pixel values of the lower and enhancement layer 
images, respectively, i and j are horizontal and vertical Cartesian coordinates, respectively, in the lower and 
enhancement layer images, h is the height of the lower layer image, and w is the width of the lower layer image. 
55 said lower layer image is a left-view image and said enhancement layer image is a right-view image. 

40. The decoder of claim 39, wherein: 
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for said minimum mean squared error, an optimal offset for chrominance pixel values is Lx/2 J. 

s 
10 
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25 
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(54) View offset estimation for steroscopic video coding 



(57) In a stereoscopic video transmission system, 
where an enhancement layer image is disparity pre- 
dicted using a lower layer images, the lower layer image 
is made to more closely match the enhancement layer 
image by shifting the lower layer image to the right to 
compensate for inter-ocular camera lens separation. 
The motion vector search range for disparity prediction 
is reduced to improve coding efficiency. At an encoder, 
the optimal offset, x, between the enhancement layer 
image and the lower layer image is determined accord- 
ing to either a minimum mean error or a minimum mean 
squared error between the enhancement and lower 



layer images. The offset x is bounded by an offset 
search range X. The x rightmost pixel columns of the 
lower layer image are deleted, and the x leftmost col- 
umns of the lower layer image are padded to effectively 
shift the lower layer image to the right by x pixels to 
obtain the reference image for use in disparity predict- 
ing the enhancement layer image. For arbitrarily shaped 
images such as VOPs within a frame, the leftmost por- 
tion is deleted and the rightmost portion is padded. At a 
decoder, the offset value x is recovered if available and 
used to reconstruct the reference frame. 
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